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Abstract 


Bayesian  inference  has  become  increasingly  important  in  statistical  machine 
learning.  Exact  Bayesian  calculations  are  often  not  feasible  in  practice,  however. 
A  number  of  approximate  Bayesian  methods  have  been  proposed  to  make  such 
calculations  practical,  among  them  the  variational  Bayesian  (VB)  approach.  The 
VB  approach,  while  useful,  can  nevertheless  suffer  from  slow  convergence  to  the 
approximate  solution.  To  address  this  problem,  we  propose  Parameter-eXpanded 
Variational  Bayesian  (PX-VB)  methods  to  speed  up  VB.  The  new  algorithm  is  in¬ 
spired  by  parameter-expanded  expectation  maximization  (PX-EM)  and  parameter- 
expanded  data  augmentation  (PX-DA).  Similar  to  PX-EM  and  -DA,  PX-VB  ex¬ 
pands  a  model  with  auxiliary  variables  to  reduce  the  coupling  between  variables 
in  the  original  model.  We  analyze  the  convergence  rates  of  VB  and  PX-VB  and 
demonstrate  the  superior  convergence  rates  of  PX-VB  in  variational  probit  regres¬ 
sion  and  automatic  relevance  determination. 


1  Introduction 

A  number  of  approximate  Bayesian  methods  have  been  proposed  to  offset  the  high  computational 
cost  of  exact  Bayesian  calculations.  Variational  Bayes  (VB)  is  one  popular  method  of  approxima¬ 
tion.  Given  a  target  probability  distribution,  variational  Bayesian  methods  approximate  the  target 
distribution  with  a  factored  distribution.  While  factoring  omits  dependencies  present  in  the  target 
distribution,  the  parameters  of  the  factored  approximation  can  be  adjusted  to  improve  the  match. 
Specifically,  the  approximation  is  optimized  by  minimizing  the  KL-divergence  between  the  factored 
distribution  and  the  target.  This  minimization  can  be  often  carried  out  iteratively,  one  component 
update  at  a  time,  despite  the  fact  that  the  target  distribution  may  not  lend  itself  to  exact  Bayesian 
calculations.  Variational  Bayesian  approximations  have  been  widely  used  in  Bayesian  learning  (e.g., 
(Jordan  et  al.,  1998;  Beal,  2003;  Bishop  &  Tipping,  2000)). 

Variational  Bayesian  methods  nevertheless  suffer  from  slow  convergence  when  the  variables  in  the 
factored  approximation  are  actually  strongly  coupled  in  the  original  model.  The  same  problem  arises 
in  popular  Gibbs  sampling  algorithm.  The  sampling  process  converges  slowly  in  cases  where  the 
variables  are  strongly  correlated.  The  slow  convergence  can  be  alleviated  by  data  augmentation  (van 
Dyk  &  Meng,  2001;  Liu  &  Wu,  1999),  where  the  idea  is  to  identify  an  optimal  reparameterization 
(within  a  family  of  possible  reparameterizations)  so  as  to  remove  coupling.  Similarly,  in  a  deter¬ 
ministic  context,  Liu  et  al.  (1998)  proposed  over-parameterization  of  the  model  to  speed  up  EM 
convergence.  Our  work  here  is  inspired  by  DA  sampling  and  PX-EM.  Our  approach  uses  auxiliary 
parameters  to  speed  up  the  deterministic  approximation  of  the  target  distribution. 

Specifically,  we  propose  Parameter-eXpanded  Variational  Bayesian  (PX-VB)  method.  The  original 
model  is  modified  by  auxiliary  parameters  that  are  optimized  in  conjunction  with  the  variational 
approximation.  The  optimization  of  the  auxiliary  parameters  corresponds  to  a  parameterized  joint 
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optimization  of  the  variational  components;  the  role  of  the  new  updates  is  to  precisely  remove  oth¬ 
erwise  strong  functional  couplings  between  the  components  thereby  facilitating  fast  convergence. 


2  An  illustrative  example 

Consider  a  toy  Bayesian  model,  which  has  been  considered  by  Liu  and  Wu  (1999)  for  sampling. 

p(y\w,z)=Af(y\w  +  z,l),  p(z)  =  N{z  |  0,  D)  (1) 

where  D  is  a  know  hyperparameter  and  p(w)  oc  1.  The  task  is  to  compute  the  posterior  dis¬ 
tribution  of  w.  Suppose  we  use  a  VB  method  to  approximate  p(w\y ),  p(z\y)  and  p(w,z\y )  by 
q(w),  q(z)  and  q(w,  z)  =  q(w)q(z),  respectively.  The  approximation  is  optimized  by  minimizing 
K L(q(w)q(z)\\p(y\iu ,  z)p(z))  (the  second  argument  need  not  be  normalized).  The  general  forms  of 
the  component  updates  are  given  by 

q(w)  ex  exp((ln  p(y\w,z)p(z))q^)  (2) 

q{z)  ex  exp((lnp(y\w,z)p(z))q{w))  (3) 

It  is  easy  to  derive  the  updates  in  this  case: 

q(w)  =Af(w\y- (z),l)  q(z)  =  U{z\  ^  ,  -  +  ^  )  (4) 

Now  let  us  analyze  the  convergence  of  the  mean  parameter  of  g(w),  (w)  =  y  —  (z).  Iteratively, 

H  =  =  D-'ii  1  +  D~1)~1y  +  (1  +  D-')-*y  +  •••)=»■ 

The  variational  estimate  ( w )  converges  to  y,  which  actually  is  the  true  posterior  mean  (For  this  toy 
problem,  p(w\ y)  =  Af(w\y,  1  +  D)).  Furthermore,  if  D  is  large,  (w)  converges  slowly.  Note  that  the 
variance  parameter  of  q( w)  converges  to  1  in  one  iteration,  though  underestimates  the  true  posterior 
variance  1  +  D. 

Intuitively,  the  convergence  speed  of  (w)  and  q(w)  suffers  from  strong  coupling  between  the  updates 
of  w  and  z.  In  other  words,  the  update  information  has  to  go  through  a  feedback  loop  w  — >  2  — > 
w  ■  ■  ■ .  To  alleviate  the  coupling,  we  expand  the  original  model  with  an  additional  parameter  a : 

p(y\w,z)=Af(y\w  +  z,l)  p(z\a)  =  Af(z  \  a,  D)  (5) 

The  expanded  model  reduces  to  the  original  one  when  a  equals  the  null  value  ao  =  0. 

Now  having  computed  q(z)  given  a  =  0,  we  minimize  KL(q(w)q(z)\\p(y\w,  z)p(z\a))  over  a  and 
obtain  the  minimizer  a  =  (z).  Then,  we  reduce  the  expanded  model  to  the  original  one  by  applying 
the  reduction  rule 

znew  =  z  —  a  =  z  —  (z),  waew  =  w  +  a  =  w  +  (z). 

Correspondingly,  we  change  the  measures  of  q(w)  and  q(z): 

q(w  +  (z))  -  ?(wnew)  =  Af(w™\y,  1)  q(z  -  (z))  q(z new)  =  AA(z“w|0,  1+1£>_1)  (6) 

Thus,  the  PX-VB  method  converges.  Here  a  breaks  the  update  loop  between  q(w)  and  q(z)  and 
plays  the  role  of  a  correction  force;  it  corrects  the  update  trajectories  of  q(w)  and  q(z)  and  makes 
them  point  directly  to  the  convergence  point. 

3  The  PX-VB  Algorithm 

In  the  general  PX-VB  formulation,  we  over-parameterize  the  model  p(x,  D)  to  get  pa  (x,  D),  where 
the  original  model  is  recovered  for  some  default  values  of  the  auxiliary  parameters  a  =  a o.  The 
algorithm  consists  of  the  typical  VB  updates  relative  to  pa(x,D),  the  optimization  of  auxiliary 
parameters  a,  as  well  as  a  reduction  step  to  turn  the  model  back  to  the  original  form  where  a  =  a q. 
This  last  reduction  step  has  the  effect  of  jointly  modifying  the  components  of  the  factored  variational 
approximation.  Put  another  way,  we  push  the  change  in  pa(x,  D),  due  to  the  optimization  of  a , 
into  the  variational  approximation  instead.  Changing  the  variational  approximation  in  this  manner 
permits  us  to  return  the  model  into  its  original  form  and  set  a  =  op. 

Specifically,  we  first  expand  p(x,  D)  to  obtain  p„  (x.  D).  Then  at  the  tth  iteration. 


1.  q(xs)  are  updated  sequentially.  Note  that  the  approximate  distribution  q(x)  =  J^s  q(xs). 

2.  We  minimize  K  L(q(x)\\pa(x,  D))  over  the  auxiliary  parameters  a..  This  optimization  can 
be  done  jointly  with  some  components  of  the  variational  distribution,  if  feasible. 

3.  The  expanded  model  is  reduced  to  the  original  model  through  reparameterization.  Accord¬ 
ingly,  we  change  g^+1^(x)  to  g^+1)(x)  such  that 

KL(q^t+1\x)\\pao(x,D))  =  KL(q(x)  ||pa(*+i)  (x,  D))  (7) 

where  r/t+1 1  (x)  are  the  modified  components  of  the  variational  approximation. 

4.  Set  a  =  a o- 

Since  each  update  of  PX-VB  decreases  or  maintains  the  KL  divergence  KL(q(x)  |p(x,  D )),  which 
is  lower  bounded,  PX-VB  reaches  a  stationary  point  for  K L(q(x)\\p(x,  !))).  Empirically,  PX-VB 
often  achieves  solution  similar  to  what  VB  achieves,  with  faster  convergence. 

A  simple  strategy  to  implement  PX-VB  is  to  use  a  mapping  Sa,  parameterized  by  a,  over  the  vari¬ 
ables  x.  After  sequentially  optimizing  over  the  components  {</(xs)},  we  maximize  (lnpa(x))g(x) 
over  a.  Then,  we  reduce  pa(x,  D)  to  p(x,  D)  and  q(x)  to  q(x)  through  the  inverse  mapping  of  Sa, 
Ma  =  S~r.  Since  we  optimize  a  after  optimizing  {g(xs},  the  mapping  S  should  change  at  least 
two  components  of  x.  Otherwise,  the  optimization  over  a.  will  do  nothing  since  we  have  already 
optimized  over  each  q(xs).  If  we  jointly  optimize  a  and  one  component  q(xs),  it  suffices  (albeit 
need  not  be  optimal)  for  the  mapping  Sa  to  change  only  q(xs). 

Algorithmically,  PX-VB  bears  a  strong  similarity  to  PX-EM  (Liu  et  al.,  1998).  They  both  expand  the 
original  model  and  both  are  based  on  lower  bounding  KL-divergence.  However,  the  key  difference 
is  that  the  reduction  step  in  PX-VB  changes  the  lower-bounding  distributions  {</(xs)|,  while  in  PX- 
EM  the  reduction  step  is  performed  only  for  the  parameters  in  p(x.  D).  We  also  note  that  the  PX-VB 
reduction  step  via  Ma  leaves  the  KL-divergence  (lower  bound  on  the  likelihood)  invariant,  while  in 
PX-EM  the  likelihood  of  the  observed  data  remains  the  same  after  the  reduction.  Because  of  these 
differences,  general  EM  acceleration  methods  (e.g.,  (Salakhutdinov  et  al.,  2003))  can  not  be  directly 
applied  to  speed  up  VB  convergence. 

In  the  following  sections,  we  present  PX-VB  methods  for  two  popular  Bayesian  models:  Probit  re¬ 
gression  for  data  classification  and  Automatic  Relevance  Determination  (ARD)  for  feature  selection 
and  sparse  learner. 

3.1  Bayesian  Probit  regression 

Probit  regression  is  a  standard  classification  technique  (see,  e.g.,  (Liu  et  al.,  1998)  for  the  maximum 
likelihood  estimation).  Here  we  demonstrate  the  use  of  variational  Bayesian  methods  to  train  Probit 
models. 

The  data  likelihood  for  Probit  regression  is 

p(t|X,w)  =  J]V(inwTx„), 

n 

where  X  =  [xl5 . . . ,  x.y]  and  a  is  the  standard  normal  cumulative  distribution  function.  We  can 


rewrite  the  likelihood  in  an  equivalent  form 

p(tn\zn)  =  sign  (tnzn)  p(zn  |w,x„)  =  7V(2:„|wtx„,  1)  (8) 

Given  a  Gaussian  prior  over  the  parameter,  p(w)  =  7V"(w|0,  uol),  we  wish  to  approx¬ 
imate  the  posterior  distribution  p(w,z|X,t)  by  q( w,z)  =  q(w)  q(zn).  Minimizing 

KL(q(w)  Jj[n  q(zn)\\p('w ,  z,t|X)),  we  obtain  the  following  VB  updates: 

q{z„)  =  TAf(zn\(w)Txn,l,tnzn)  (9) 

q{ w)  =  A/’(w|(XXt  +  t^1I)'1X(z),  (XXT  +  t^I)"1)  (10) 

where  TAf(zn\{w)Txn:  1,  tnzn)  stands  for  a  truncated  Gaussian  such  that 


TJ\f(zn\(w)Txn,  1,  tnzn)  =  Af(zn\ (w)Tx„,  1)  when  tnzn  >  0,  and  it  equals  0  otherwise. 


To  speed  up  the  convergence  of  the  above  iterative  updates,  we  apply  the  PX-VB  method.  First,  we 
expand  the  orginal  model  p( w,  z,  t|X)  to  pc( w,  z,  t|X)  with  the  mapping 

w  =  wc  z  =  zc  (11) 


such  that 

PcOn|w,x„)  =  Af(zn\wTxn,c2)  p(w)  =  J\f(w\0,c2v0I)  (12) 

Setting  c  =  Co  =  1  in  the  expanded  model,  we  update  q(zn)  and  q( w)  as  before,  via  (9)  and  (10). 
Then,  we  minimize  KL(q(z)q(w)\\pc(w,  z,  t|X))  over  c,  yielding 

c2  =  N  +  -2(^n)(w)Tx„  +  x^(wwT)xra)  +r0~1(wwT))  (13) 

n 

where  M  is  the  dimension  of  w.  In  the  degenerate  case  where  vq  =  oo,  the  denominator  of  the 
above  equation  becomes  N  instead  of  N  +  M.  Since  this  equation  can  be  efficiently  calculated,  the 
extra  computational  cost  induced  by  the  auxiliary  variable  is  therefore  small.  We  omit  the  details. 


The  transformation  back  to  pCo  can  be  made  via  the  inverse  map 

w  =  w/c  z  =  z/c.  (14) 


Accordingly,  we  change  q( w)  to  obtain  a  new  posterior  approximation  gc(w): 

5c( w)  =  ■A/’(w|(XXT  +  u0-1I)-1X(z)/c,  (XXT  +  v^Tf1 /<?)  (15) 

We  do  not  actually  need  to  compute  qc{zn)  if  this  component  will  be  optimized  next. 

By  changing  variables  w  to  w  through  (14),  the  KL  divergence  between  the  approximate  and  exact 
posteriors  remains  the  same.  After  obtaining  new  approximations  qc( w)  and  q(zn),  we  reset  c  = 
Co  =  1  for  the  next  iteration. 

Though  similar  to  the  PX-EM  updates  for  the  Probit  regression  problem  (Liu  et  al.,  1998),  the  PX- 
VB  updates  are  geared  towards  providing  an  approximate  posterior  distribution. 

We  use  both  synthetic  data  and  a  kidney  biopsy  data  (van  Dyk  &  Meng,  2001)  as  numerical  examples 
for  probit  regression.  We  set  vq  =  oo  in  the  experiment.  The  comparison  of  convergence  speeds  for 
VB  and  PXVB  is  illustrated  in  figure  1 . 


Figure  1:  Comparison  between  VB  and  PX-VB  for  probit  regression  on  synthetic  (a)  and  kidney- 
biospy  data  sets  (b).  PX-VB  converges  significantly  faster  than  VB.  Note  that  the  Y  axis  shows  the 
difference  between  two  consecutive  estimates  of  the  posterior  mean  of  the  parameter  w. 

For  the  synthetic  data,  we  randomly  sample  a  classifier  and  use  it  to  define  the  data  labels  for  sampled 
inputs.  We  have  100  training  and  500  test  data  points,  each  of  which  is  20  features.  The  kidney 
data  set  has  55  data  points,  each  of  which  is  a  3  dimensional  vector.  On  the  synthetic  data,  PX- 
VB  converges  immediately  while  VB  updates  are  slow  to  converge.  Both  PX-VB  and  VB  trained 
classifiers  achieve  zero  test  error.  On  the  kidney  biopsy  data  set,  PX-VB  converges  in  507  iterations, 
while  VB  converges  in  7518  iterations.  In  other  words,  PX-VB  requires  15  times  fewer  iterations 
than  VB.  In  terms  of  CPU  time,  which  reflects  the  extra  computational  cost  induced  by  the  auxiliary 
variables,  PX-VB  is  14  times  more  efficient.  Among  all  these  runs,  PX-VB  and  VB  achieve  very 
similar  estimates  of  the  model  parameters  and  the  same  prediction  results.  In  sum,  with  a  simple 
modification  of  VB  updates,  we  significantly  improve  the  convergence  speed  of  variational  Bayesian 
estimation  for  probit  model. 


3.2  Automatic  Relevance  Determination 


Automatic  relevance  determination  (ARD)  is  a  powerful  Bayesian  sparse  learning  technique 
(MacKay,  1992;  Tipping,  2000;  Bishop  &  Tipping,  2000).  Here,  we  focus  on  variational  ARD 
proposed  by  Bishop  and  Tipping  (2000)  for  sparse  Bayesian  regression  and  classification. 

The  likelihood  for  ARD  regression  is 

p(t|X,w,r)  =  n^(f„|wT0„,T-1) 

n 

where  <pn  is  a  feature  vector  based  on  xn,  such  as  [fc(xi,x„), . . . ,  [&(x jv,  xra)]T  where  fc(x.j,  xj) 
is  a  nonlinear  basis  function.  For  example,  we  can  choose  a  radial  basis  function  k(x,,  x:/)  = 
exp(—  ||xj  —  Xj||/(2A2),  where  A  is  the  kernel  width. 

In  ARD,  we  assign  a  Gaussian  prior  on  the  model  parameters  w:  p(w|a)  = 
where  the  inverse  variance  diag(ct)  follows  a  factorized  Gamma  distribution: 

p(ac )  =  J^Gamma(am|a,  b)  =  ba a^1  e~bam /V {a)  (16) 

m  m 

where  a  and  b  are  hyperparameters  of  the  model.  The  posterior  does  not  have  a  closed  form.  Let 
us  approximate  p(w,  a,r|X,t)  by  a  factorized  distribution  g(w,  a,r)  =  g(w)g(a)g(r).  The 
sequential  VB  updates  on  q(r),  q(w)  and  q(at)  are  described  by  Bishop  and  Tipping  (2000). 

The  variational  RVM  achieves  good  generalization  performance  as  demonstrated  by  Bishop  and 
Tipping  (2000).  However,  its  training  based  on  the  VB  updates  can  be  quite  slow.  We  apply  PX-VB 
to  address  this  issue. 

First,  we  expand  the  original  model  p(w,  6l,  f  |X,  t)  via 

w  =  w/r  (17) 

while  maintaining  a  and  f  unchanged.  Consequently,  the  data  likelihood  and  the  prior  on  w  become 

M 

pr(t|w,X,r)  =  J|A/'(fn|rwT^)„,r_1)  pr(w|a)  =  7V(wm|0, r_2a“1)  (18) 

n  m— 0 

Setting  r  =  ro  =  1,  we  update  q( r)  and  q(a)  as  in  the  regular  VB.  Then,  we  want  to  joint  optimize 
over  q( w)  and  r.  Instead  of  performing  a  fully  joint  optimization,  we  optimize  q( w)  and  r  separately 
at  the  same  time.  This  gives 

g+Vg2  +  16Mf 

4/ 

where  /  =  (r)  J2n  x«  (wwT)x»  +  Em(wm)(«m)  and  g  =  2 (r)  Em(wT)x«i«-  where  (wT)  and 
(wwT)  are  the  first  and  second  order  moments  of  the  previous  g(w).  Since  both  f  and  XT(w)  has 
been  computed  previously  in  VB  updates,  the  added  computational  cost  for  r  is  negligible  overall. 
The  separate  optimization  over  q( w)  and  r  often  decreases  the  KL  divergence.  But  it  cannot  guar¬ 
antee  to  achieve  a  smaller  KL  divergence  than  what  optimization  only  over  q( w)  would  achieves.  If 
the  regular  update  over  q( w)  achieves  a  smaller  KL  divergence,  we  reset  r  =  1. 

Given  r  and  q( w),  we  use  w  =  r w  to  reduce  the  expanded  model  to  the  original  one.  Cor¬ 
respondingly,  we  change  q( w)  =  Af(w\fj,w,  Em)  via  this  reduction  rule  to  obtain  qr( w)  = 

Af  (w\rp,w,  r2Su)). 

We  can  also  introduce  another  auxiliary  variable  s  such  that  a  =  a/s.  Similar  to  the  above  proce¬ 
dure,  we  optimize  over  s  the  expected  log  joint  probability  of  the  expanded  model,  and  at  the  same 
time  update  q(a).  Then  we  change  q(ot)  back  to  qs(a)  using  the  inverse  mapping  a  =  sot.  Due  to 
the  space  limitation,  we  skip  the  details  here. 

The  auxiliary  variables  r  and  s  change  the  individual  approximate  posteriors  q( w)  and  q(ot)  sep¬ 
arately.  We  can  combine  these  two  variables  into  one  and  use  it  to  adjust  q( w)  and  q(ot)  jointly. 
Specifically,  we  introduce  the  variable  c: 

w  =  w/c 


a  =  c2ot. 
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Figure  2:  Convergence  comparison  between  VB  and  PX-VB  for  ARD  regression  on  synthetic  data 
(a,b)  and  gene  expression  data  (c).  The  PX-VB  results  in  (a)  and  (c)  are  based  on  independent  aux- 
iliar  variables  on  w  and  a.  The  PX-VB  result  in  (b)  is  based  on  the  auxiliar  variable  that  correlates 
both  w  and  a.  The  added  computational  cost  for  PX-VB  in  each  iteraction  is  negligible  overall. 


Setting  c  =  Co  =  1,  we  perform  the  regular  updates  over  q(r),  q( w)  and  q(ot).  Then  we  optimize 
over  c  the  expected  log  joint  probablity  of  the  expanded  model.  We  cannot  find  a  closed-form  solu¬ 
tion  for  the  maximization.  But  we  can  efficiently  compute  its  gradient  and  Hessian.  Therefore,  we 
perform  a  few  steps  of  Newton  updates  to  partially  optimize  c.  Again,  the  additional  computational 
cost  for  calculating  c  is  small.  Then  using  the  inverse  mapping,  we  reduce  the  expanded  model  to 
the  original  one  and  adjust  both  q(w)  and  q(a)  accordingly.  Empirically,  this  approach  can  achieve 
faster  convergence  than  using  auxiliary  variables  on  q( w)  and  q(a)  separately.  This  is  demonstrated 
in  figure  2(a)  and  (b). 

We  compare  the  convergence  speed  of  VB  and  PX-VB  for  the  ARD  model  on  both  synthetic  data 
and  gene  expression  data.  The  synthetic  data  are  sampled  from  the  function  sine  (a?)  =  (sinx)  jx  for 
x  £  (—10, 10)  with  added  Gaussian  noise.  We  use  RBF  kernels  for  the  feature  expansion  <pn  with 
kernel  width  3.  VB  and  PX-VB  provide  basically  identical  predictions.  For  gene  expression  data, 
we  apply  ARD  to  analyze  the  relationship  between  binding  motifs  and  the  expression  of  their  target 
genes.  For  this  task,  we  use  3  order  polynomial  kernels. 

The  results  of  convergence  comparison  are  shown  in  figure  2.  With  a  little  modification  of  VB 
updates,  we  increase  the  convergence  speed  significantly.  Though  we  demonstrate  PX-VB  improve¬ 
ment  only  for  ARD  regression,  the  same  technique  can  be  used  to  speed  up  ARD  classification. 

4  Convergence  properties  of  VB  and  PX-VB 

In  this  section,  we  analyze  convergence  of  VB  and  PX-VB,  and  their  convergence  rates. 

Define  the  mapping  q(t+1)  ==  M( qW)  as  one  VB  update  of  all  the  approximate  distributions. 
Define  an  objective  function  as  the  unnormalized  KF  divergence: 

<2(q)=y  n  g»(x) iog  ) + (/ p(x)dx  ~  /  n^x)dx)-  (2°) 

It  is  easy  to  check  that  minimizing  Q(q)  gives  the  same  updates  as  VB  which  minimizes  KF  diver¬ 
gence. 

Based  on  Theorem  2.1  by  Fuo  and  Tseng  (1992),  an  iterative  application  of  this  mapping  to  minimize 
Q(q)  results  in  at  least  linear  convergence  to  an  element  q*  in  the  solution  set. 

Define  the  mapping  q^t+1)  =  Mx{ q^)  as  one  PX-VB  update  of  all  the  approximate  distribu¬ 
tions.  The  convergence  of  PX-VB  follows  from  similar  arguments,  i.e.,/3  =  [q 1  av\v  converges  to 
[q*TctJ]T,  where  a  €  A  are  the  expanded  model  parameters,  «0  are  the  null  value  in  the  original 
model. 

4.1  Convergence  rate  of  VB  and  PX-VB 

The  matrix  rate  of  convergence  DM (q): 

q(*+t)  -  q*  =  DM( q)T(q(t)  -  q*) 


(21) 


where  DM{ q)  = 

Define  the  global  rate  of  convergence  for  q:  r  =  Hindoo  ^qq(f)_~q  |  ^  ■ 

Under  certain  regularity  conditions,  r  =  the  largest  eigenvalue  of  DM(q) .  The  smaller  r  is, 
the  faster  the  algorithm  converges. 

Define  the  constraint  set  gs  as  the  constraints  for  the  sth  update.  Then  the  following  theorem  holds: 
Theorem  4.1  The  matrix  convergence  rate  for  VB  is: 

s 

DM{ q*)  =  P°  (22) 

S  —  1 

where  Ps  =  Bs[Bj(D2Q(c?))-1Bs]-1Bj(D2Q(q*)y1  andBs  =  V9s(q*). 

Proof:  Define  £  as  the  current  approximation  q.  Let  Gs(£)  be  qs  that  maximizes  the  objective 


function  Q(q)  under  the  constraint  gs(q)  =  gs(£)  = 

Let  Mo(q)  =  q  and 

Ms (q)  =  G's(Ms_1(q))  for  all  1  <  s  <  S.  (23) 

Then  by  construction  of  VB,  we  have  q(t+s/s)  =  Ms( qW),  s  =  1, . . . ,  S  and  DM( q*)  = 

DMs( q*).  At  the  stationary  points,  q*  =  DMS( q*)  for  all  s. 

We  differentiate  both  sides  of  equation  (23)  and  evaluate  them  at  q  =  q*: 

DMS(  q)  =  £>Ms_1(q)DGs(Mfl_1(  q*))  =  DMs^(q*)DGs(<l*)  (24) 

It  follows  that  DM (q*)  =  Ilf=i  DGs{ q*). 

To  calculate  DGs( q*),  we  differentiate  the  constraint  gs(Gs(£))  =  gs(£)  and  evaluate  both  sides 
at  £  =  q",  such  that 

DGS(  q*)Bs  =  Bs.  (25) 

Similarly,  we  differentiate  the  Lagrange  equation  DQS(G(£,))  ~  ^ 9s(G(£))\s(£)  =  0  and  evaluate 
both  sides  at  £  =  q*.  This  yields 

DG s(ci*) D2 Q s(q*)  -  DXs(q*)Bj  =  0  (26) 

Q 2 

Equation  (26)  holds  because  dqgq  =  0. 

Combining  (25)  and  (26)  yields 

DGS{  q*)  =  Bs[Bj  (Z)2Qs(q*))_1Bs]_1Bj  (D2Qs(  q*))^.D  (27) 


In  the  s  update  we  fix  q\s,  i.e.,  y/s (q)  =  q\  s.  Therefore,  Bs  is  an  identity  matrix  with  its  sth  column 
removed  B  s  =  I.  s,  where  I  is  the  identity  matrix  and  s, :  means  without  the  sth  column. 

Denote  Cs  =  (D2QS( q*))  1.  Without  the  loss  of  generality,  we  set  s  =  S.  It  is  easy  to  obtain 

B|GBS  =  C\S>\s  (28) 

where  \S,  \S  means  without  row  S  and  column  S. 


Inserting  (28)  into  (27)  yields 

Ps  =  DGs( q*)  =  (  Id0  1  C\S,\SQ°\S,S 


Id-1  -D2Q\s,s(D2Qs,s)-1 

0  0 


(29) 


where  ld-i  is  a  (d  —  1)  by  (d  —  1)  identity  matrix,  and  D2Q\g  s  =  d  and  D2Qs,s  = 

9  Notice  that  we  use  Schur  complements  to  obtain  (29).  Similar  to  the  calculation  of 

Ps  via  (29),  we  can  derive  Ps  for  s  =  1, . . . ,  S  —  1  with  structures  similar  to  If. 


(30) 


The  above  results  help  us  understand  the  convergence  speed  of  VB.  For  example,  we  have 

q(t+1)  -  q  *  =  pj...p?  (qW  -  q*). 

For  qs,  q£+1)  -  qj  =  (  -  (D2Qs^)-1D2QSAg  0)  (qC*+(®-i)/S)  -  q*). 

Clearly,  if  we  view  D2Qg\g  as  the  correlation  between  qg  and  q\g,  then  the  smaller  “correlation”, 
the  faster  the  convergence.  In  the  extreme  case,  if  there  is  no  correlation  between  qg  and  q  g,  then 

q^-1  ' '  —  q^  =  0  after  the  first  iteration.  Since  the  global  convergence  rate  is  bounded  by  the 
maximal  component  convergence  rate  and  generally  there  are  many  components  with  convergence 
rate  same  as  the  global  rate.  Therefore,  the  instant  convergence  of  qg  could  help  increase  the  global 
convergence  rate. 

For  PX-VB,  we  can  compute  the  matrix  rate  of  convergence  similarly.  In  the  toy  example  in 
Section  2,  PX-VB  introduces  an  auxiliary  variable  a  which  has  zero  correlation  with  w,  lead¬ 
ing  an  instant  convergence  of  the  algorithm.  This  suggests  that  PX-VB  improves  the  conver¬ 
gence  by  reducing  the  correlation  among  { qs } .  Rigorously  speaking,  the  reduction  step  in  PX- 
VB  implictly  defines  a  mapping  between  q  to  qao  through  the  auxiliary  variables  a:  (q.  pao )  — » 
(q,pa)  — >  (qa,P«0)-  Denote  this  mapping  as  Ma  such  as  qa  =  Ma( q).  Then  we  have 
DMx{ q*)  =  DGi(q*)  ■  ■  ■  DGa( q*)  •  •  •  DGg( q*) 

It  is  known  that  the  spectral  norm  has  the  following  submultiplicative  property  ||£iF||  <=  ||£,||  ||F||, 
where  E  and  F  are  two  matrices.  Thus,  as  long  as  the  largest  eigenvalue  of  Ma  is  smaller  than  1, 
PX-VB  converges  faster  than  VB.  The  choice  of  a  affects  the  convergence  rate  by  controlling  the 
eigenvalue  of  this  mapping.  The  smaller  the  largest  eigenvalue  of  Ma,  the  faster  PX-VB  converges. 
In  practice,  we  can  check  this  eigenvalue  to  make  sure  the  constructed  PX-VB  algorithm  enjoys  a 
fast  convergence  rate. 

5  Discussion 

We  have  provided  a  general  approach  to  speeding  up  convergence  of  variational  Bayesian  learning. 
Faster  convergence  is  guaranteed  theoretically  provided  that  the  Jacobian  of  the  transformation  from 
auxiliary  parameters  to  variational  components  has  spectral  norm  bounded  by  one.  This  property 
can  be  verified  in  each  case  separately.  Our  empirical  results  show  that  the  performance  gain  due  to 
the  auxiliary  method  is  substantial. 
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